Motivation

Add a conjugate prior to $\theta$ in pLSA, reduce the number of parameters from $MK+KV$ to $K+KV$ , which is independent from the number of documents and makes the model less prone to overfitting.
Besides the $\beta$ matrix, in pLSA, we learn M points; but in LDA, we learn a dirichlet. So we can easily generalize the trained model to unseen documents.

Specification

original version:

smoothed version:

Parameters for original version:

$K$ : number of topics
$M$ : number of documents
$N_m$ : length of m-th document
$V$ : size of vocabulary
$\alpha$ : Dirichlet prior
$\beta$ : $K\times V$ , word distribution of topics
$\theta$ : $M\times K$ , topic distribution of documents

Parameters for smoothed version:

α β θ d = 1 \dots M ϕ k = 1 \dots K z d = 1 \dots M, n = 1 \dots N d w d = 1 \dots M, n = 1 \dots N d \sim \sim \sim \sim \sim \sim A Dirichlet hyperprior, either a constant or a random variable A Dirichlet hyperprior, either a constant or a random variable Dirichlet K (α) Dirichlet V (β) Categorical K (θ d) Categorical V (ϕ z d n)

${\begin{array}{lcl}{\boldsymbol \alpha }&\sim &{\text{A Dirichlet hyperprior, either a constant or a random variable}}\\{\boldsymbol \beta }&\sim &{\text{A Dirichlet hyperprior, either a constant or a random variable}}\\{\boldsymbol \theta }_{{d=1\dots M}}&\sim &\operatorname {Dirichlet}_{K}({\boldsymbol \alpha })\\{\boldsymbol \phi }_{{k=1\dots K}}&\sim &\operatorname {Dirichlet}_{V}({\boldsymbol \beta })\\z_{{d=1\dots M,n=1\dots N_{d}}}&\sim &\operatorname {Categorical}_{K}({\boldsymbol \theta }_{d})\\w_{{d=1\dots M,n=1\dots N_{d}}}&\sim &\operatorname {Categorical}_{V}({\boldsymbol \phi }_{{z_{{dn}}}})\\\end{array}}$

In the orginal version,
Joint distribution:

p (θ, z, w | α, β) = p (θ | α) \prod n = 1 N p (z n | θ) p (w n | z n, β)

$p(\theta,z,w|\alpha,\beta)=p(\theta|\alpha)\prod_{n=1}^N p(z_n|\theta)p(w_n|z_n,\beta)$
The marginal distribution of a document:

p (w | α, β) = \int p (θ | α) ⎛ ⎝ ⎜ ⎜ \prod n = 1 N \sum z n p (z n | θ) p (w n | z n, β) ⎞ ⎠ ⎟ ⎟ d θ

$p(w|\alpha,\beta)=\int p(\theta|\alpha) \left( \prod_{n=1}^N\sum_{z_n}p(z_n|\theta)p(w_n|z_n,\beta )\right) \mathrm{d}\theta$
Posterior distribution of hidden variables:

p (θ, z | w, α, β) = p ( θ , z , w | α , β ) p ( w | α , β )

$p(\theta,z|w,\alpha,\beta)=\frac{p(\theta,z,w|\alpha,\beta)}{p(w|\alpha,\beta)}$
This posterior is intractable to compute, due to the coupling between

θ $\theta$ and

β $\beta$ .

Variational distribution

Here we come up with a variational distribution on latent variables, which can be decomposed as:

q (θ, z | γ, ϕ) = q (θ, γ) \prod n = 1 N q (z n | ϕ n)

$q(\theta,z|\gamma,\phi)=q(\theta,\gamma)\prod_{n=1}^Nq(z_n|\phi_n)$
and optimize:

(γ *, ϕ *) = argmin γ, ϕ D K L (q (θ, z | γ, ϕ) ∥ p (θ, z | w, α, β))

$(\gamma^*,\phi^*)=\underset{\gamma,\phi}{\operatorname{argmin}} D_{\mathrm{KL}}(q(\theta,z|\gamma,\phi)\|p(\theta,z|w,\alpha,\beta))$
to minimize the difference between the variational distribution and the true posterior distribution.
Play with the formula above and we will get:

D K L (q ∥ p) + L (γ, ϕ; α, β) = log p (w | α, β) = constant for γ, ϕ

$D_{\mathrm{KL}}(q\|p)+L(\gamma,\phi;\alpha,\beta)=\log p(w|\alpha,\beta)= \text{constant for }\gamma,\phi$
where

L (γ, ϕ; α, β) = E q [log p (θ, z, w | α, β)] - E q [log q (θ, z)]

$L(\gamma,\phi;\alpha,\beta)=\operatorname E_q[\log p(\theta,z,w|\alpha,\beta)] -\operatorname E_q[\log q(\theta,z)]$
So minimizing the KL divergence is equivalent to maximizing the function

L $L$ as the lower bound of

logp(w|α,β) $\log p(w|\alpha,\beta)$ :

(γ *, ϕ *) = argmax γ, ϕ L (γ, ϕ; α, β)

$(\gamma^*,\phi^*)=\underset{\gamma,\phi}{\operatorname{argmax}} L(\gamma,\phi;\alpha,\beta)$

Parameter estimation

Variational EM algorithm:

E-step: maximize the lower bound $L(\gamma,\phi;\alpha,\beta)$ with respect to the variational parameters $\gamma$ and $\phi$
M-step: maximize the bound with respect to the model parameters $\alpha$ and $\beta$

E-step: variational inference

A few more steps:

L (γ, ϕ; α, β) = E q [log p (θ, z, w | α, β)] - E q [log q (θ, z)] = E q [log p (θ | α)] + E q [p (z | θ)] + E q [p (w | z, β)] - E q [log q (θ)] - E q [log q (z)]

$\begin{align} L(\gamma,\phi;\alpha,\beta) &{}=\operatorname E_q[\log p(\theta,z,w|\alpha,\beta)]-\operatorname E_q[\log q(\theta,z)]\\ &{}=\operatorname E_q[\log p(\theta|\alpha)]+\operatorname E_q[p(z|\theta)]+\operatorname E_q[p(w|z,\beta)]-\operatorname E_q[\log q(\theta)]-\operatorname E_q[\log q(z)] \end{align}$
Struggle through heavy math to compute each term and we finally get (

ψ $\psi$ is the digamma function):

Taking derivatives of this function and set derivatives to zero yields the update formulas.
The variational inference algorithm update

γ $\gamma$ and

ϕ $\phi$ alternately until convergence:

M-step

Maximize $L(\gamma,\phi;\alpha,\beta)$ with respect to $\beta$ :

L β = \sum d = 1 M \sum n = 1 N d \sum i = 1 K \sum j = 1 V ϕ d n i w j d n log β i j + \sum i = 1 K λ i ⎛ ⎝ ⎜ ⎜ \sum j = 1 V β i j - 1 ⎞ ⎠ ⎟ ⎟

$L_\beta=\sum_{d=1}^M\sum_{n=1}^{N_d}\sum_{i=1}^K\sum_{j=1}^V \phi_{dni}w_{dn}^j\log\beta_{ij}+\sum_{i=1}^K\lambda_i\left(\sum_{j=1}^V\beta_{ij}-1\right)$
Taking the derivative with respect to

βij $\beta_{ij}$ and setting it to zero:

β i j \propto \sum d = 1 M \sum n = 1 N d ϕ d n i w j d n

$\beta_{ij}\propto\sum_{d=1}^M\sum_{n=1}^{N_d}\phi_{dni}w^j_{dn}$

Maximize $L(\gamma,\phi;\alpha,\beta)$ with respect to $\alpha$ :

L α = \sum d = 1 M ⎛ ⎝ ⎜ ⎜ log Γ ⎛ ⎝ ⎜ ⎜ \sum j = 1 K α j ⎞ ⎠ ⎟ ⎟ - \sum i = 1 K log Γ (α i) ⎞ ⎠ ⎟ ⎟ + \sum i = 1 K ⎛ ⎝ ⎜ ⎜ (α i - 1) ⎛ ⎝ ⎜ ⎜ ψ (γ d i) - ψ ⎛ ⎝ ⎜ ⎜ \sum j = 1 K γ d j ⎞ ⎠ ⎟ ⎟ ⎞ ⎠ ⎟ ⎟ ⎞ ⎠ ⎟ ⎟

$L_\alpha=\sum_{d=1}^M\left(\log\Gamma\left(\sum_{j=1}^K\alpha_j\right)-\sum_{i=1}^K\log\Gamma(\alpha_i)\right) +\sum_{i=1}^K\left((\alpha_i-1)\left(\psi(\gamma_{di})-\psi\left(\sum_{j=1}^K\gamma_{dj}\right)\right)\right)$
Taking the derivative with respect to

αi $\alpha_i$ :

\partial L \partial α i = M ⎛ ⎝ ⎜ ⎜ ψ ⎛ ⎝ ⎜ ⎜ \sum j = 1 K α j ⎞ ⎠ ⎟ ⎟ - ψ (α i) ⎞ ⎠ ⎟ ⎟ + \sum d = 1 M ⎛ ⎝ ⎜ ⎜ ψ (γ d i) - ψ ⎛ ⎝ ⎜ ⎜ \sum j = 1 K γ d j ⎞ ⎠ ⎟ ⎟ ⎞ ⎠ ⎟ ⎟

$\frac{\partial L}{\partial \alpha_i}=M\left(\psi\left(\sum_{j=1}^K\alpha_j\right)-\psi(\alpha_i)\right) +\sum_{d=1}^M\left(\psi(\gamma_{di})-\psi\left(\sum_{j=1}^K\gamma_{dj}\right)\right)$
It is difficult to compute

αi $\alpha_i$ by setting the derivative to zero. So we compute the Hessian Matrix by: (if

i=j $i=j$ then

δ(i,j)=1 $\delta(i,j)=1$ or 0 otherwise)

\partial 2 L \partial α i \partial α j = M ⎛ ⎝ ⎜ ⎜ ψ' ⎛ ⎝ ⎜ ⎜ \sum j = 1 K α j ⎞ ⎠ ⎟ ⎟ - δ (i, j) ψ' (α i) ⎞ ⎠ ⎟ ⎟

$\frac{\partial^2L}{\partial\alpha_i\partial\alpha_j}= M\left(\psi'\left(\sum_{j=1}^K\alpha_j\right) -\delta(i,j)\psi'(\alpha_i)\right)$
and input this Hessian Matrix and the derivative to Newton Method to get

α $\alpha$ .

Gibbs Sampling

(for smoothed version)
Theoretical analysis:

due to conjugate prior. Note that the normalizer of the first term is omitted, because the sum is the length of each document, which is fixed, while the second denominator might change after each update.

Algorithm:

Comparisons and discussions for MCMC and Variational Bayes see Variational Bayes.

Extensions

Relaxing the assumptions:

order of words doesn't matter ("Bag of words" assumption)
order of documents doesn't matter => time-evolving, dynamic topic model
the number of topics is assumed known, fixed and flat => Bayesian nonparametric topic model
topics are not correlated => correlated topic model
...

incorporating meta-data:

authors, links, other labels(supervised)...

References

Blei, David M. "Probabilistic topic models." Communications of the ACM 55.4 (2012): 77-84.
"Machine Learning" Lecture 19: http://www.umiacs.umd.edu/~jbg/teaching/CSCI_5622/
"Probabilistic Models for Unsupervised Learning" Lecture 5: http://home.cse.ust.hk/~lzhang/teach/6931a/
Dirichlet-Multinomial Distribution: https://en.wikipedia.org/wiki/Dirichlet-multinomial_distribution#A_combined_example:_LDA_topic_models
Darling, William M. "A theoretical and practical implementation tutorial on topic modeling and gibbs sampling." Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies. 2011.

Latent Dirichlet Allocation